The following packages are used:

library(dplyr)
library(ggplot2)
library(jsonlite)
library(knitr)
library(readr)
library(stringr)
library(tidyr)
library(zoo)

Introduction

Hearthstone is a popular collectible card game published by Blizzard Entertainment in 2014, which is based on the Warcraft series by the same company. The goal of the game is to build a deck of 30 cards and win by defeating the opponent’s or getting the opponent to concede first.

Cards can be classified according to the following categories:

  • Class: Whether it can be used by only one, or many of the 9 classes in the game (Neutral)
  • Rarity: Free, Common, Rare, Epic, Legendary
  • Type: Minion, Spell, Weapon, Hero
  • Set: Besides the Basic and Classic sets, additional cards are added to the game during expansions; the set also determines if it is eligible for Standard format.

Datasets Used

There are three datasets:

  • data.csv contains a list of decks submitted by players to HearthPwn from 2013 (pre-launch) to 2017,
  • refs.json contains detailed information about all cards (both collectible and non-collectible) up to March 2017.
  • cards_collectible.json contains detailed information about all the cards that are collectible in the game.

Questions

  1. Which are the most popular cards used in Ranked decks?

Breaking down the question

Which are the most popular cards used in Ranked decks?

We focus on the Ranked format where players get to decide which cards to include in their deck, therefore the cards’ popularity are more accurately represented, and the gameplay is not subject to additional constraints that other game modes (like Tavern Brawls and Adventures) may impose.

How do we determine popularity?

We determine popularity by the number of decks that include at least 1 copy in the starting 30 cards (not generated by other effects).

A deck can include at most 2 copies of any card (1 for Legendary cards), thus a card’s popularity is not heavily influenced by the number of copies players wish to use.

Possible biases to consider

Since Neutral cards can be used by multiple classes, they should be more popular than Class-specific cards.

For the Wild format, cards from the older sets may be more popular simply because they have been in the game longer.

For the Standard format, cards from the Basic and Classic sets will be more popular because they do not rotate out of the format unlike expansion cards.

Why address such a question?

If a certain card becomes too popular (i.e. the community thinks players must include it in their decks), it reduces the card variety in the metagame and makes gameplay frustrating for other players (amongst other consequences). In the long term, this may lead to player attrition and loss of potential revenue (when players purchase card packs or other cosmetics).

Historically, Blizzard has dealt with problematic cards in one of several ways:

Loading the data

data_path <- file.path("..", "data")

decks_raw <- read_csv(file.path(data_path, "data.csv"))
cards_raw <- fromJSON(file.path(data_path, "cards_collectible.json"), flatten = TRUE)

Decks data

The decks_raw data has 346232 rows and 41 columns. The first 11 columns describe the deck’s attributes (like date submitted, class, deck format) while the remaining 30 columns describe the cards each deck contains (based on the card’s unique ID which can be referenced from the cards_raw data).

Detailed information on the variables can be found on the Kaggle dataset.

names(decks_raw)
##  [1] "craft_cost"     "date"           "deck_archetype" "deck_class"    
##  [5] "deck_format"    "deck_id"        "deck_set"       "deck_type"     
##  [9] "rating"         "title"          "user"           "card_0"        
## [13] "card_1"         "card_2"         "card_3"         "card_4"        
## [17] "card_5"         "card_6"         "card_7"         "card_8"        
## [21] "card_9"         "card_10"        "card_11"        "card_12"       
## [25] "card_13"        "card_14"        "card_15"        "card_16"       
## [29] "card_17"        "card_18"        "card_19"        "card_20"       
## [33] "card_21"        "card_22"        "card_23"        "card_24"       
## [37] "card_25"        "card_26"        "card_27"        "card_28"       
## [41] "card_29"
# code is not run
glimpse(decks_raw) 

There are 8 rows that contain missing data. The missing values are in the 10th column, which contains the decks’ title as submitted by the users, so they can be safely ignored.

sum(!complete.cases(decks_raw))
## [1] 8
which(is.na(decks_raw), arr.ind = TRUE)
##         row col
## [1,]  16747  10
## [2,] 175608  10
## [3,] 216047  10
## [4,] 238021  10
## [5,] 278491  10
## [6,] 326192  10
## [7,] 329285  10
## [8,] 329286  10

Cards data

Additional information on the variables can be found on HearthstoneJSON.

There are two identifier fields for the cards: a character/string id and an integer dbfId. The decks_raw dataset uses the integer IDs to reference cards used.

dim(cards_raw)
## [1] 1751   65
# code is not run
names(cards_raw)
# code is not run
glimpse(cards_raw)

The following computes the number of missing values in each field, with the exception of those that are present as lists or data frames (mechanics, referencedTags, classes, entourage)

# code is not run
sapply(cards_raw, function(x) sum(is.na(x)))

Pre-processing

Decks data

The raw dataset is split into two, one containing the deck attributes and the other containing the deck composition (cards), with deck_id acting as the unique identifier. We also exclude decks created before launch (there were many card changes in the alpha and beta stages, making card popularity very volatile).

The decks_comp data will be pivoted to long format later on, thus excluding fields that are not related to the cards will minimize the size of the dataset.

launch_date <- as.Date("2014-03-11")

decks_attr <- decks_raw %>% 
    filter(date >= launch_date) %>% 
    select(deck_id, craft_cost:deck_format, deck_set:user)

decks_comp <- decks_raw %>% 
    filter(date >= launch_date) %>% 
    select(deck_id, card_0:card_29)

Within decks_attr, the factor/enumerated columns are identified and recast accordingly.

fct_cols_attr <- c("deck_archetype", "deck_class", "deck_format", "deck_set", "deck_type")

decks_attr[fct_cols_attr] <- lapply(decks_attr[fct_cols_attr], factor)
rm(fct_cols_attr)

Years and Months

While each deck has a submission date, we may also be interested in grouping the decks by month (which corresponds to Ranked seasons) and by year (which is marked by expansion release dates instead of calendar dates).

Years in the game based on a time period that:

  • Starts with the release of the first card set of each year, which usually falls around April.
  • Ends with the release of the first card set the next year (non-inclusive).

So based on the release dates, the years would be:

  • 2014-03-11 to 2015-04-01 (Live, Naxxramas, Goblin vs Gnomes)
  • 2015-04-02 to 2016-04-25 (Blackrock, Grand Tournament, League of Explorers)
  • 2016-04-26 to 2017-03-19 (Old Gods, Karazhan, Gadgetzan)
# month (with year)
decks_attr$hsmonth <- as.yearmon(decks_attr$date)

# year (not by calendar)
decks_attr$hsyear <- case_when(
        decks_attr$date <= as.Date("2015-04-01") ~ "2014",
        decks_attr$date >= as.Date("2015-04-02") & 
            decks_attr$date <= as.Date("2016-04-25") ~ "2015",
        decks_attr$date >= as.Date("2016-04-26") ~ "2016"
    ) %>% 
        factor()

Deck Format

The Standard and Wild formats were formally introduced into the game on 2016-04-26 with the release of Whispers of the Old Gods; however the graphic below shows that many decks from June 2014 to April 2016 were marked as Wild.

We can simply relabel all decks created before 2016-04-26 as Standard since all cards before then are not separated by format:

decks_attr$deck_format[decks_attr$date < as.Date("2016-04-26")] <- "S"

A summary of the processed data is shown below:

##     deck_id         craft_cost         date           
##  Min.   : 36923   Min.   :    0   Min.   :2014-03-11  
##  1st Qu.:253573   1st Qu.: 2840   1st Qu.:2015-05-26  
##  Median :428597   Median : 5120   Median :2016-02-09  
##  Mean   :419989   Mean   : 5745   Mean   :2015-12-21  
##  3rd Qu.:603508   3rd Qu.: 7840   3rd Qu.:2016-08-09  
##  Max.   :749548   Max.   :48000   Max.   :2017-03-19  
##                                                       
##          deck_archetype     deck_class    deck_format
##  Unknown        :220501   Mage   :42230   S:307743   
##  Midrange Shaman:  5472   Priest :41756   W: 16361   
##  Control Priest :  5135   Paladin:39368              
##  Control Warrior:  4939   Warlock:35598              
##  Tempo Mage     :  4545   Druid  :35488              
##  Midrange Hunter:  4371   Shaman :33969              
##  (Other)        : 79141   (Other):95695              
##              deck_set              deck_type          rating        
##  Explorers       : 57307   Arena        :  8178   Min.   :   0.000  
##  Old Gods        : 49895   None         : 75120   1st Qu.:   1.000  
##  Blackrock Launch: 38900   PvE Adventure:  9059   Median :   1.000  
##  Gadgetzan       : 31329   Ranked Deck  :202104   Mean   :   2.777  
##  Naxx Launch     : 22283   Tavern Brawl :  6360   3rd Qu.:   1.000  
##  Yogg Nerf       : 22175   Theorycraft  : 19686   Max.   :4016.000  
##  (Other)         :102215   Tournament   :  3597                     
##     title               user              hsmonth      hsyear      
##  Length:324104      Length:324104      Min.   :2014   2014: 65119  
##  Class :character   Class :character   1st Qu.:2015   2015:128062  
##  Mode  :character   Mode  :character   Median :2016   2016:130923  
##                                        Mean   :2016                
##                                        3rd Qu.:2017                
##                                        Max.   :2017                
## 

Cards data

Many columns in the cards_raw data pertain to the card’s stats, mechanics and play requirements, which is better explained by the text on the card image. So we choose to only include the columns that we consider are core properties of the card (and not sufficiently explained by the card text):

cards_simple <- select(cards_raw,
                       dbfId,  name, cost,   cardClass, 
                       rarity, type,  set, collectible, id)
## Observations: 1,751
## Variables: 9
## $ dbfId       <int> 2539, 2541, 2545, 2572, 2542, 2549, 2571, 2544, 25...
## $ name        <chr> "Flame Lance", "Effigy", "Fallen Hero", "Arcane Bl...
## $ cost        <int> 5, 3, 2, 1, 3, 4, 3, 6, 8, 5, 4, 4, 1, 3, 2, 2, 4,...
## $ cardClass   <chr> "MAGE", "MAGE", "MAGE", "MAGE", "MAGE", "MAGE", "M...
## $ rarity      <chr> "COMMON", "RARE", "RARE", "EPIC", "RARE", "COMMON"...
## $ type        <chr> "SPELL", "SPELL", "MINION", "SPELL", "SPELL", "MIN...
## $ set         <chr> "TGT", "TGT", "TGT", "TGT", "TGT", "TGT", "TGT", "...
## $ collectible <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
## $ id          <chr> "AT_001", "AT_002", "AT_003", "AT_004", "AT_005", ...

The set column contains abbreviated names or nicknames and is not necessarily informative; we create a new column that uses the actual names of the card sets:

unique(cards_simple$set)
##  [1] "TGT"          "BOOMSDAY"     "BRM"          "GANGS"       
##  [5] "CORE"         "EXPERT1"      "HOF"          "NAXX"        
##  [9] "GILNEAS"      "GVG"          "HERO_SKINS"   "ICECROWN"    
## [13] "KARA"         "LOE"          "LOOTAPALOOZA" "OG"          
## [17] "UNGORO"
# list for recoding card sets
cardset_lst <- list(
    "CORE" = "Basic",
    "EXPERT1" = "Classic",
    # 2014 sets
    "NAXX" = "Curse of Naxxramas",
    "GVG" = "Goblins vs Gnomes",
    # 2015 sets
    "BRM" = "Blackrock Mountain",
    "TGT" = "The Grand Tournament",
    "LOE" = "League of Explorers",
    # 2016 sets
    "OG" = "Whispers of the Old Gods",
    "KARA" = "One Night in Karazhan",
    "GANGS" = "Mean Streets of Gadgetzan",
    "HOF" = "Hall of Fame",
    # 2017 sets
    "UNGORO" = "Journey to Un'Goro",
    "ICECROWN" = "Knights of the Frozen Throne",
    "LOOTAPALOOZA" = "Kobolds & Catacombs",
    # 2018 sets
    "GILNEAS" = "The Witchwood",
    "BOOMSDAY" = "The Boomsday Project"
)

cards_simple$card_set <- recode(cards_simple$set, !!!cardset_lst,
                                .default = "Other")

# drop the set column to avoid confusion (it remains in the cards_raw dataset)
cards_simple$set <- NULL
# remove list (no longer needed)
rm(cardset_lst)

Some columns are entirely uppercase, which we convert to title case for readability:

titlecase_cols <- c("cardClass", "rarity", "type")
cards_simple[titlecase_cols] <- lapply(cards_simple[titlecase_cols], 
                                       str_to_title)
rm(titlecase_cols)

The factor/enumerated columns are then identified and recast accordingly.

fct_cols_cards <- c("cardClass", "rarity", "type", "card_set")

cards_simple[fct_cols_cards] <- lapply(cards_simple[fct_cols_cards], factor)
rm(fct_cols_cards)

A summary of the processed data is shown below:

##      dbfId           name                cost          cardClass  
##  Min.   :    7   Length:1751        Min.   : 0.000   Neutral:657  
##  1st Qu.: 1987   Class :character   1st Qu.: 2.000   Paladin:123  
##  Median :38957   Mode  :character   Median : 4.000   Hunter :122  
##  Mean   :25375                      Mean   : 3.856   Mage   :122  
##  3rd Qu.:43163                      3rd Qu.: 5.000   Warlock:122  
##  Max.   :53187                      Max.   :20.000   Druid  :121  
##                                     NA's   :22       (Other):484  
##        rarity        type      collectible         id           
##  Common   :612   Hero  :  33   Mode:logical   Length:1751       
##  Epic     :298   Minion:1192   TRUE:1751      Class :character  
##  Free     :142   Spell : 471                  Mode  :character  
##  Legendary:253   Weapon:  55                                    
##  Rare     :446                                                  
##                                                                 
##                                                                 
##                          card_set  
##  Classic                     :236  
##  Basic                       :142  
##  Journey to Un'Goro          :135  
##  Knights of the Frozen Throne:135  
##  Kobolds & Catacombs         :135  
##  The Boomsday Project        :135  
##  (Other)                     :833

Mislabelled cards

As the decks are generated by human input, and there are multiple cards with the same name, it is recommended to check for cards that have the same name but wrong ID.

Specifically, we are looking for the version of each card that is collectible (since all cards used in Ranked decks must be collectible). This step requires loading the full card data (which contains non-collectible cards).

# IDs of all cards used
cards_used  <- decks_comp %>%
        select(card_0:card_29) %>%
        unlist(use.names = FALSE) %>%  # flatten into a vector
        unique() %>%
        sort()
# IDs of missing cards (using dbfId)
missing_cards <- cards_used[!cards_used %in% cards_simple$dbfId]

# how many cards are not found in the collectible cards dataset?
length(missing_cards)
## [1] 15
# data for all cards
cards_all_raw <- fromJSON(file.path(data_path, "refs.json"), flatten = TRUE)

# filter for those missing cards only
mssng_cards <- cards_all_raw %>% 
    select(dbfId, name, cardClass, type, collectible) %>% 
    filter(dbfId %in% missing_cards)

To find the correct IDs, we join them by name to the cards_simple. The dbfID.x on the left would be replaced by the dbfID.y on the right:

mislabelled <- mssng_cards %>% 
    select(dbfId, name) %>% 
    # also drops uncollectible mislabelled cards (which are not in simple_cards)
    inner_join(cards_simple, by = "name") %>% 
    arrange(name)
mislabelled
dbfId.x name dbfId.y cost cardClass rarity type collectible id card_set
40341 Cleave 940 2 Warrior Free Spell TRUE CS2_114 Basic
2177 Dark Wispers 2009 6 Druid Epic Spell TRUE GVG_041 Goblins vs Gnomes
42146 Doppelgangster 40953 5 Neutral Rare Minion TRUE CFM_668 Mean Streets of Gadgetzan
38319 Druid of the Claw 692 5 Druid Common Minion TRUE EX1_165 Classic
2230 Druid of the Fang 2048 5 Druid Common Minion TRUE GVG_080 Goblins vs Gnomes
2310 Druid of the Flame 2292 3 Druid Common Minion TRUE BRM_010 Blackrock Mountain
40402 Evolve 38266 1 Shaman Rare Spell TRUE OG_027 Whispers of the Old Gods
41409 Jade Idol 40372 1 Druid Rare Spell TRUE CFM_602 Mean Streets of Gadgetzan
468 Mark of Nature 151 3 Druid Common Spell TRUE EX1_155 Classic
41609 Nefarian 2261 9 Neutral Legendary Minion TRUE BRM_030 Blackrock Mountain
38113 Raven Idol 13335 1 Druid Common Spell TRUE LOE_115 League of Explorers
1161 Starfall 86 5 Druid Rare Spell TRUE NEW1_007 Classic
38710 Unstable Portal 1929 2 Mage Rare Spell TRUE GVG_003 Goblins vs Gnomes
38653 Wisp 179 0 Neutral Common Minion TRUE CS2_231 Classic
137 Wrath 836 2 Druid Common Spell TRUE EX1_154 Classic

We create a named list that can be used within recode():

# list values are correct ids
mislab_recode <- as.list(mislabelled$dbfId.y)
# list names are mislabelled ids
names(mislab_recode) <- mislabelled$dbfId.x
# intermediate objects no longer needed
rm(cards_used, missing_cards, cards_all_raw, mssng_cards)

Other objects

The following items may be used for plotting:

# class colours
class_colors <- c(
    "Druid" = "#FF7D0A",
    "Hunter" = "#228B22", #"#ABD473",
    "Mage" = "#40C7EB",
    "Paladin" = "#F58CBA",
    "Priest" = "#FFFFFF",
    "Rogue" = "#FFF569",
    "Shaman" = "#0070DE",
    "Warlock" = "#8787ED",
    "Warrior" = "#C79C6E",
    "Neutral" = "#777777"
)

# rarity colours
rarity_colors <- c(
    "Free" = NA,
    "Common" = "#000000",
    "Rare" = "#1E90FF",
    "Epic" = "#9932CC",
    "Legendary" = "#FFB90F"
)

# release dates of card sets
release_dates <- c(
    "Launch" = "2014-03-11", 
    "Naxxramas" = "2014-07-22", 
    "GvG" = "2014-12-08",
    "Blackrock" = "2015-04-02",
    "TGT" = "2015-08-24",
    "Explorers" = "2015-11-12",
    "Old Gods" = "2016-04-26",
    "Karazhan" = "2016-08-11",
    "Gadgetzan" = "2016-12-01"
)

# darker plot theme for this doc
theme_doc <- theme(panel.spacing = unit(1, "points"),
                   panel.grid.major.y = element_blank(), 
                   plot.title = element_text(size = 12),
                   axis.text.y = element_text(size = 9),
                   panel.background = element_rect(fill = "#C3C3C3"))

Summary and Reflection

So far, we have looked at cards that users tend to include in decks in Standard format for Ranked play, which is also used for official Hearthstone tournaments - making these popular cards highly visible to a wide audience. We have also looked at card popularity when broken down by various categories, such as class, time period and card set.

Are there any limitations to the data that may have affected our analysis?

The major limitation of this data is that it only looks at decks submitted to a third-party website, which brings up the following issues:

  • Decks submitted may not necessarily be played in the game itself, either because other players think it is too weak, or because the user submits a joke deck that contains absurd combinations of cards and is not meant to be taken seriously.
  • There is no data on how often decks and cards are actually played in the Ranked format.
  • Likewise, there is no information on how effective the decks are at winning games. While the rating attribute may reflect how strong other players consider a deck, it is also biased towards the popularity of the user as well as the date of submission:
    • A deck may be initially strong and highly rated, but as new cards are introduced and old cards are removed from Standard format, the deck may wane in strength, but users are unlikely to retract their votes by this point in time.

How can we expand on this analysis?

  • Examine popular combination of cards that complement each other well.
  • Examine whether the crafting cost (in dust) of decks has any relation to its popularity (rating).